Skip to content

Template job scripts with Jinja#370

Closed
wtbarnes wants to merge 22 commits into
dask:mainfrom
wtbarnes:jinja-templates
Closed

Template job scripts with Jinja#370
wtbarnes wants to merge 22 commits into
dask:mainfrom
wtbarnes:jinja-templates

Conversation

@wtbarnes

@wtbarnes wtbarnes commented Nov 24, 2019

Copy link
Copy Markdown
Member

This PR addresses #133 and is a first pass at using Jinja2 to template job submission scripts. See also #338. In general, the approach I've used here is to store the templates as separate files in templates/ such that there is a separate template for the worker command, the top level job script, and then a separate file for the job header for each scheduler. Rather than appending strings to a list, all of the formatting logic is now contained in the templates using the Jinja syntax.

Thus far, I've modified the top level job script and the worker command to use Jinja2. The following job scheduler headers have been retemplated with Jinja:

  • PBS
  • HTCondor
  • LSF
  • Moab
  • OAR
  • SGE
  • SLURM

Closes #133, closes #321.

@wtbarnes wtbarnes changed the title WIP: Template job scripts with Jinja Template job scripts with Jinja Nov 27, 2019
@guillaumeeb

guillaumeeb commented Nov 27, 2019

Copy link
Copy Markdown
Member

I don't have the time to check this right now, but I just wanted to say thanks for working on this!!

@jacobtomlinson jacobtomlinson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat!

@wtbarnes

Copy link
Copy Markdown
Member Author

@guillaumeeb no problem at all. I understand that PRs of this size require a lot of effort on the part of maintainers!

It looks like the failing SGE job is due to problem with installing some packages from apt. Could someone restart that job to see if that fixes things? I don't think the failure is related to anything in this PR.

@lesteve

lesteve commented Nov 28, 2019

Copy link
Copy Markdown
Member

Thanks a lot, as for @guillaumeeb I may not have a lot of time to look at this in detail before some time.

What I would like in an ideal world:

  • fix a few things and release a 0.7.1 version, hopefully in a matter of weeks.
  • review this PR, merge it and maybe release a 0.8, hopefully in January

@lesteve

lesteve commented Nov 28, 2019

Copy link
Copy Markdown
Member

Hmmm I have restarted the build and the SGE error is similar. Probably something to look into ...

@lesteve

lesteve commented Dec 1, 2019

Copy link
Copy Markdown
Member

OK restarting the SGE job a few more times fixed the issue ...

Note this intermittent issue has been seen on master from time to time, e.g. https://travis-ci.org/dask/dask-jobqueue/jobs/617519102.

@lesteve lesteve added this to the 0.8 milestone Dec 1, 2019
@lesteve lesteve mentioned this pull request Dec 1, 2019
@wtbarnes

wtbarnes commented Dec 1, 2019

Copy link
Copy Markdown
Member Author

Thanks @lesteve! I know this is a big PR and will take some time to review.

I currently have access to a PBS cluster and will test this out "in the wild." It'd be great if anyone with access to systems running Condor, LSF, or OAR could give this a go as well. I'm sure there are some hangups related to job script formatting that need to be worked out.

@wtbarnes

Copy link
Copy Markdown
Member Author

Just wanted to check back in and see if anyone has had time to test this out or look it over 🙂 Happy to incorporate any feedback!

If accepted, is the plan still to include this in the 0.8 release and is there a rough timeline for that release?

@guillaumeeb

Copy link
Copy Markdown
Member

Hey @wtbarnes, we're not forgetting you and are again really happy you took a stab at this.

I guess the first planning describe by @lesteve above got a bit delayed, we're probably too busy currently to make the small progress we wanted to achieve. It is thus difficult to give you a timeline as we're already late... We definitely plan to take a closer look at this one for the 0.8 release.

In the mean time and if you feel like it, you could take a look at the few issues identified by @lesteve for a 0.7.1 release and try to fix them? I think there are easy enough to fix (but may be wrong).

@guillaumeeb guillaumeeb left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first review with a couple of questions:

  • Shouldn't we add jinja2 in the requirement file?
  • If a user wants to customize the jinja templates for any reason (which is the final goal here, if a cluster job queuing system has some specific quirk, we want a elegant solution to bypass it), how is that done? Maybe this is already feasible by dropping a new template somewhere, or we need to add a way to override the default templates. In any case this should be documented somewhere in the docs. Can you work on this?

@wtbarnes

Copy link
Copy Markdown
Member Author

I guess the first planning describe by @lesteve above got a bit delayed, we're probably too busy currently to make the small progress we wanted to achieve. It is thus difficult to give you a timeline as we're already late... We definitely plan to take a closer look at this one for the 0.8 release.

Thanks for the quick response @guillaumeeb. I'm not really in any hurry here, just wanted to check in. I know that release timelines are always a bit optimistic 😄, especially when all the work is done by volunteers!

In the mean time and if you feel like it, you could take a look at the few issues identified by @lesteve for a 0.7.1 release and try to fix them? I think there are easy enough to fix (but may be wrong).

Sure, I can try to look these over. These would be just those issues/PRs milestoned for 0.7.1?

@wtbarnes

Copy link
Copy Markdown
Member Author
  • Shouldn't we add jinja2 in the requirement file?

Yes! I meant to do this.

  • If a user wants to customize the jinja templates for any reason (which is the final goal here, if a cluster job queuing system has some specific quirk, we want a elegant solution to bypass it), how is that done? Maybe this is already feasible by dropping a new template somewhere, or we need to add a way to override the default templates. In any case this should be documented somewhere in the docs. Can you work on this?

Jinja allows you to specify multiple possible template loaders. In my current implementation, the templates are only being loading from the package,

https://github.com/wtbarnes/dask-jobqueue/blob/934e403bb81b929458c240e7f1a0aa3a02108fda/dask_jobqueue/core.py#L244

This can be easily modified to accept custom templates that override the default choices,

loader = ChoiceLoader([
    DictLoader(custom_templates),
    PackageLoader("dask-jobqueue", "templates")
])
env = Environment(loader=loader)

where custom_templates is a dictionary containing any templates the user wants to override. If the dictionary is empty, each template will just fall back to those specified in the package.

The bigger question is what is the best entry point for the user to specify these templates? The Jinja environment lives on the Job base class so we would need to work out how best to propagate this information from cluster instantiation to the instantiation of each job.

My personal opinion is that this could probably be its own PR as it is complicated enough and really tackles a separate (though certainly related) issue. However, if people would rather see these changes all merged in at once, I'm happy to give that a go as well.

@guillaumeeb

guillaumeeb commented Feb 21, 2020

Copy link
Copy Markdown
Member

My personal opinion is that this could probably be its own PR as it is complicated enough and really tackles a separate (though certainly related) issue. However, if people would rather see these changes all merged in at once, I'm happy to give that a go as well.

Okay, that sounds fair!

@lesteve

lesteve commented Mar 4, 2020

Copy link
Copy Markdown
Member

@wtbarnes sorry I haven't had time to look at this one indeed ...

This is the kind of PR that is quite hard to review because the diff is quite sizeable. One thing that would make me a bit a lot more relax about this PR would be if you could generate some reference submission scripts (with cluster.job_script()) from master exploring a bit different parameters and compare the submission scripts you get in this PR.

If you manage to get something going along these lines, feel free to post a snippet here!

@lesteve

lesteve commented Mar 4, 2020

Copy link
Copy Markdown
Member

Also just for completeness, my current dask-jobqueue priority is to try to make some progress towards 0.7.1 in the next few weeks!

@guillaumeeb

Copy link
Copy Markdown
Member

I think we should take another look at this one. There's been some little change on the master branch code. Would you be willing to update your PR @wtbarnes?

@wtbarnes

Copy link
Copy Markdown
Member Author

Sorry, I've not had the time to come back to this in a while!

I see there are quite a few conflicts now with several of the different scheduler files. I'm happy to rebase this against the master branch and try to fix those if that is the preferred workflow,

@jacobtomlinson

Copy link
Copy Markdown
Member

@wtbarnes yes please! :)

@wtbarnes

Copy link
Copy Markdown
Member Author

Ok I've rebased this against master and added Jinja2 to the requirements file, but now it seems all of the tests are failing so I've a bit more work to do to get this into good shape.

@lesteve

lesteve commented Jun 1, 2020

Copy link
Copy Markdown
Member

@wtbarnes sorry I introduced a conflict by merging a PR. Here is my current thinking:

  • release a 0.7.2 with the few improvements that went in since 0.7.1. Maybe in a week or two 🤞
  • look at this PR more seriously for 0.8

Sorry this is taking such a long time, but I am afraid this PR is not trivial to review at all and I don't have that much time for the Dask-Jobqueue maintenance these days ...

Something that would make this PR easier to review as I mentioned a few months ago (already? 😓)

One thing that would make me a bit a lot more relax about this PR would be if you could generate some reference submission scripts (with cluster.job_script()) from master exploring a bit different parameters and compare the submission scripts you get in this PR.

@guillaumeeb

Copy link
Copy Markdown
Member

@wtbarnes if you're stiull around, this would be the good time to finish this one!

@wtbarnes

wtbarnes commented Jul 5, 2021

Copy link
Copy Markdown
Member Author

I've just done a rebase against main. The conflicts were not as bad as I suspected and I've tried to make sure all of my changes are still consistent with any changes that have happened on the particular job schedulers since I opened this PR.

I'm not sure what the merge policy is on this repo, but this should probably be squash-merged

EDIT: It looks like there are quite a few test failures here, likely because of things that have not been included in the new jinja templates. @sdtaylor I wonder if your proposed changes will take care of those?

Re: the seemingly extraneous JOB_ID, I'm not sure where that came from or why exactly it seems to show up in so many of the templates.

@jacobtomlinson jacobtomlinson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you don't mind I've pushed in a quick fix for the failing LFS test (template was missing quotes).

I also renamed the templates to have a .j2 file extension which helps with syntax highlighting in many editors and GitHub.

@sdtaylor

sdtaylor commented Jul 6, 2021

Copy link
Copy Markdown
Contributor

for the JOB_ID it looks like those were used when the jinja templates were first made, but have since been removed. So those lines in the template can be taken out. See #396

@sdtaylor

sdtaylor commented Jul 6, 2021

Copy link
Copy Markdown
Contributor

I made some small adjustments here to drop JOB_ID and fix the interface double entry https://github.com/sdtaylor/dask-jobqueue/tree/jinja-templates. I don't have write permissions to this repo so if someone can merge them.

With those changes the current comparison looks pretty good. The only minor thing is various newlines where entries in the jinja template are skipped:

######################
Script mismatch for PBS1:
######################
--- 

+++ 

@@ -1,13 +1,13 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #PBS -N dask-worker
 #PBS -q myqueue
 #PBS -A myproject
 #PBS -l select=1:ncpus=1:mem=1908MB
 #PBS -l walltime=00:02
+
 #PBS -x extra-option1
 #PBS -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --interface lo --protocol tcp://
-
######################
Script mismatch for MOAB1:
######################
--- 

+++ 

@@ -1,13 +1,13 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #PBS -N dask-worker
 #PBS -q myqueue
 #PBS -A myproject
 #PBS -l select=1:ncpus=1:mem=1908MB
 #PBS -l walltime=00:02
+
 #PBS -x extra-option1
 #PBS -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --interface lo --protocol tcp://
-
######################
Script mismatch for SGE1:
######################
--- 

+++ 

@@ -1,14 +1,15 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #$ -N dask-worker
 #$ -q myqueue
 #$ -P myproject
+
 #$ -l h_rt=00:02
+
 #$ -cwd
 #$ -j y
 #$ -x extra-option1
 #$ -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
######################
Script mismatch for HTCondor:
######################
--- 

+++ 

@@ -1,18 +1,18 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 MY.DaskWorkerName = "htcondor--$F(MY.JobId)--"
 RequestCpus = MY.DaskWorkerCores
 RequestMemory = floor(MY.DaskWorkerMemory / 1048576)
 RequestDisk = floor(MY.DaskWorkerDisk / 1024)
 MY.JobId = "$(ClusterId).$(ProcId)"
 MY.DaskWorkerCores = 1
 MY.DaskWorkerMemory = 2000000000
 MY.DaskWorkerDisk = 1000000000
 
+
 Environment = ""
 Arguments = "-c '/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://'"
 Executable = /bin/sh
 
 Queue
-
######################
Script mismatch for OAR1:
######################
--- 

+++ 

@@ -1,12 +1,11 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #OAR -n dask-worker
 #OAR -q myqueue
 #OAR --project myproject
 #OAR -l /nodes=1/core=1,walltime=00:02
 #OAR -x extra-option1
 #OAR -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
######################
Script mismatch for LSF1:
######################
--- 

+++ 

@@ -1,14 +1,15 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #BSUB -J dask-worker
+
 #BSUB -q myqueue
 #BSUB -P "myproject"
 #BSUB -n 1
+
 #BSUB -M 1
 #BSUB -W 00:02
 #BSUB -x extra-option1
 #BSUB -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
######################
Script mismatch for SLURM1:
######################
--- 

+++ 

@@ -1,15 +1,15 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #SBATCH -J dask-worker
+
 #SBATCH -p myqueue
 #SBATCH -A myproject
 #SBATCH -n 1
 #SBATCH --cpus-per-task=1
 #SBATCH --mem=2G
 #SBATCH -t 00:02
 #SBATCH -x extra-option1
 #SBATCH -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-

@wtbarnes

wtbarnes commented Jul 8, 2021

Copy link
Copy Markdown
Member Author

I've removed the JOB_ID from all of the templates. Are the newline differences important in any of these submission scripts? My experience is limited to SLURM and PBS and as far as I know neither of them care too much about extraneous newlines.

@guillaumeeb

Copy link
Copy Markdown
Member

Thanks everyone for reviving this one!!

I've got two questions:

  • Is it in a state to review?
  • One of the goal of this is to make it easy to customize job_script to adapt to specific job scheduler setup. How can one change a jinja template for a given scheduler?

@wtbarnes

wtbarnes commented Jul 13, 2021

Copy link
Copy Markdown
Member Author

Is it in a state to review?

I believe so though I would defer to @sdtaylor or @lesteve as well

One of the goal of this is to make it easy to customize job_script to adapt to specific job scheduler setup. How can one change a jinja template for a given scheduler?

I think this old comment is still relevant: #370 (comment). It is very easy to modify the Jinja Environment to accept custom templates (that can then override the default templates). However, I'm still not sure what the ideal entrypoint would be for these custom templates. My original thinking was that this PR would simply reproduce the current functionality and then a separate PR would add the ability to supply custom templates.

@sdtaylor

Copy link
Copy Markdown
Contributor

Yes, I feel it's ready for review.

@guillaumeeb guillaumeeb left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I have some comments on potentially simplifying things, or relying more on templates.

Comment thread dask_jobqueue/htcondor.py
return env


def env_lines_to_dict(env_lines):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be done directly in the template file?

Comment thread dask_jobqueue/lsf.py
return result


def set_ncpus(ncpus, worker_cores, logger):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for LSF, couldn't we code this in the template file?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible though the templating logic can easily get messy and it is often easier and cleaner to offload more complicated logic to filters (which is exactly what is done here). Additionally, I don't think you can easily log things from within a Jinja template.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking of removing the log, which don't seem really useful to me.

But actually, in this case, we could have kept the ncpus and mem settings in the __init__ method, don't we?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree regarding ncpus and mem. They're so simple they should just be in the __init__.

The format_memory function should probably remain as a filter because of the complexity in lsf_format_bytes_ceil and lsf_detect_units

Comment thread dask_jobqueue/pbs.py
return "%dB" % n


def pbs_format_resource_spec(resource_spec, worker_cores, worker_memory):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for PBS, in the template file?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment on the LSF templating/filtering

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just giving it a try here. I think in the template it would look something like:

{% if resource_spec is not none -%}
#PBS -l {{ resource_spec }}
{% else -%}
#PBS -l select=1:ncpus={{ worker_cores }}:mem={{ worker_memory | pbs_format_bytes_ceil }}
{% endif -%}

I like that because it's clear with a single look at the template file what is done. Thoughts?

@wtbarnes wtbarnes Sep 10, 2021

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! That's much more simple. And then pbs_format_bytes_ceil would just need to be added as a filter.

I wonder if we should filter on worker_cores that ensures that it is an integer?

#PBS -l select=1:ncpus={{ worker_cores | round | int }}:mem={{ worker_memory | pbs_format_bytes_ceil }}

Comment thread dask_jobqueue/tests/test_htcondor.py
@wtbarnes

wtbarnes commented Feb 8, 2023

Copy link
Copy Markdown
Member Author

Unfortunately I don't have much time these days to invest in dask-jobqueue and am not using it in my work any longer so I'm going to close this PR. Furthermore, it was opened so long ago and has so many conflicts now that it isn't even clear to me if this is useful any longer.

If anyone would like to revive this, please feel free to reopen.

@wtbarnes wtbarnes closed this Feb 8, 2023
@lesteve

lesteve commented Feb 9, 2023

Copy link
Copy Markdown
Member

Thanks a lot for this work which definitely looked useful and sorry I never managed to find the time to review it fully 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all job schedulers enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use configurable templated header with jinja2 as in batchspawner Provide custom header

6 participants